This notebook details my approach to building predictive models for newly released games on boardgamegeek.com (BGG). Specifically, I am interested in taking any newly released boardgame and, using features that are available at the time of its release, estimating how it will be received on BGG: its average rating, number of user ratings, and complexity rating.
While the goal of this project is ultimately to yield accurate predictions for upcoming games, we are also interested in understanding what the models learn. What features of games are associated with high/low average rating? Why do some games receive high numbers of user ratings? What types of games are the most complex?
To answer these questions, we’ll make use of historical data from boardgamegeek. We will connect to a database containing game features and their current ratings on BGG. For this analysis, in training models, we will restrict ourserlves to games published through 2019. We will validate the performance of our models by evaluating their performance in predicting games published in 2020.
The data we are using comes from boardgamegeek.com, which we access by using the open BGG API. We are training models on data that last pulled from BGG on 2022-02-21.
We will be training models at the game-level, where every row corresponds to one game and every column corresponds to a feature of the game. As of this date, there are 22071 on boardgamegeek with at least 30 user ratings.
For the purpose of this analysis, the training set will only include games 2020 that have achieved at least 100 user ratings. This is a design decision to restrict our sample to games that 1) have received some evaluation from the community and 2) speed up the time in training models. We can later view this as a parameter for tuning, allowing more or less historical games to enter the model for training. Based on some initial tests, 100 was a useful cutoff point for both model performance and training time.
We are interested in modeling a number of different outcomes: a game’s average rating, complexity rating, and number of user ratings.
These outcomes aren’t independent, as complexity and the average rating are highly correlated.
As we will see, this means if we want to predict a game’s average rating, the most important feature is usually its average weight. But because these a game’s average rating and complexity are both voted on by the BGG community, we won’t know a game’s average rating at the time of its release. This means for newly upcoming games, we will first use a model to estimate a game’s complexity and then use that estimate as the input into our average rating model.
What features do we have about games? We have basic information about every game, such as its player count and playing time, and we also have many BGG outcomes, such as the number of comments, number of people trading, which we will not use in predicting the outcomes we care about. We have some missingness present in the playing time variables that we will address in our recipe preparing the data.
We also have a variety of information about game mechanics, categories, artists, publishers, designers, artists, and so on. Some of these categories are not observed for every game, such as if a game doesn’t have expansions or integrations with other games.
This means there are ~180 different mechanics, ~7k publishers, ~ 10k designers, and ~ 11k artists present in our training set. This is good in the sense that we have ample information about games for models to look at and use in training, but bad in the sense that if we threw all of it into a model we would quickly run up against the the curse of dimensionality.
type | n_types | n_games |
publisher | 7,177 | 22,070 |
category | 84 | 21,789 |
designer | 9,978 | 21,445 |
mechanic | 182 | 20,456 |
family | 3,566 | 18,392 |
artist | 11,128 | 15,995 |
expansion | 23,528 | 5,604 |
implementation | 5,622 | 5,000 |
integration | 2,415 | 1,770 |
compilation | 671 | 863 |
How can we make use of this information for modeling? We could create dummy variables for every different type, but this will quickly create thousands of features, many of which are going to contain little information. We would view this as a P > N problem and let the data speak for itself via methods of feature selection and dimension reduction.
Alternatively, every game had only one mechanic/designer/publisher, we could mean encode on the training set. For instance, instead of using thousands of dummy variables for each designer, we would have one ‘designer_mean’ feature that is simply the value the designer’s mean value in the training set. This can dramatically reduce the dimensionality of categorical features while keeping the information we want.
For our purposes, the hang up with taking a simple mean encoding approach is that a game may have multiple designers, categories, mechanics, artists, and publishers. For designers we might be able to get by with taking the mean of the designer means, but it starts to get more complicated with mechanics - most games have multiple different mechanics, and its the combination of different mechanics that are we interested in exploring. The other complication is that some designers have only designed a handful of games, while others have designed hundreds, so the mean may not impart the same amount of information.
On top of all of this, we have to be careful in what features we allow to enter a model, as some of the categories about games are themselves a reflection of the outcomes we want to predict.
With all this in mind, we’ll do bit of inspection to figure out which features of games we’ll allow to enter our training recipe, in essence using a manual filtering method to select features.
One set of features relates to a game’s “family”, which is sort of a catch all term for various buckets that games might fall into: Kickstarters, dungeon crawls, tableau builders, etc. Some of these are likely to be very useful in training a model, while others should be omitted. We don’t, for instance, want to include whether a game has digital implementations, as these are a reflection of a game’s popularity. These sets of features also have a very long tail, with some families only having one or two games in them. We’ll filter to remove families with near zero variance, removing features on this variable that apply to a little less than 1% of games.
Some features we won’t include, such as the Mensa Select or implementations on BoardGameArena, as these are outcomes that typically occur when a game has been popular and shouldn’t be used as predictors.
We’ll do the same thing for categories, but this variable is much smaller and generally pretty well organized.
We’ll include all of these, though there will likely be some overlap between these and other features which we can take care of with a correlation filter.
Mechanics are also pretty well organized, so we don’t have to do much filtering.
We’ll just keep all of the mechanics, as these are the main features of games that we’ll focus our attention on.
How should we handle artist and designer effects? We’ll use a much lower minimum proption here, as very few designers would have designed ~ 100 games.
This amounts to allowing for designers once they have released about 15 games. We’ll more or less take the same approach for artists.
Publishers are a bit more tricky. If we look at the top rated publishers, we’ll notice something a bit odd. Some of the publishers we’ll recognize, but we also see some names that might not make a a lot of sense. Why are Asmodee Italia and Galapagos towards the top? The reason for this is foreign language publishers - once a game becomes popular enough, these games end up being published in foreign languages. This means certain publishers are a reflection of the outcome we are trying to predict (the average and bayes average), and shouldn’t be used as predictors in models of these outcomes.
So what should we do? I’ve settled on creating a ‘white-list’ for publishers, meaning publishers that have been the original publisher of popular games.
Putting this all together, we will keep the selected categorical features, creating dummy variables for each, which we will then parse down through a combination of near zero variance and correlation filters before modeling, then ultimately conducting feature selection within the modeling process.
# combine into one table
categorical_features_selected= rbindlist(mget(ls(pattern = "selected_"))) %>%
as_tibble() %>%
mutate(selected = "yes")
# select in full game types set
game_types_selected = game_types %>%
left_join(., categorical_features_selected %>%
select(type, id, value, tidied, selected),
by = c("type", "id", "value")) %>%
filter(selected == 'yes')
# pivot and spread these out
game_types_pivoted =game_types_selected %>%
select(game_id, type, value) %>%
mutate(type_abbrev = substr(type, 1, 3)) %>%
mutate(value = tolower(gsub("[[:space:]]", "_", gsub("\\s+", " ", gsub("[[:punct:]]","", value))))) %>%
mutate(type = paste(type, value, sep="_")) %>%
mutate(has_type = 1) %>%
select(-value) %>%
pivot_wider(names_from = c("type"),
values_from = c("has_type"),
id_cols = c("game_id"),
names_sep = "_",
values_fn = min,
values_fill = 0)
# now join
games_model = active_games %>%
left_join(.,
game_types_pivoted,
by = "game_id")
# remove objects we don't need
rm(train_types,
game_types_selected,
game_types_pivoted,
publishers,
categorical_features_selected,
families,
mechanics,
categories,
designers,
artists)
rm(list = ls(pattern = "selected_"))
rm(list = ls(pattern = "summarized"))
We can now proceed to building predictive models. We’ll split the data into training, validation, and test sets, then do a bit of exploratory analysis on the training set. We’ll then create recipes which we use in training and evaluating our models.
First, we’ll split the data. We’ll set up our training, validation, and test sets based on the year games are published. Our training set will be games published prior to 2020 while our main validation set will be games published in 2020. We’ll use resampling within our training set to tune our models, validating their performance on the validation set. The test set will be all games published after our validation set.
# get full dataset
games_full = games_model %>%
mutate(dataset = case_when(yearpublished <= params$end_training_year ~ 'train',
yearpublished == params$end_training_year+1 ~ 'validation',
TRUE ~ 'test')) %>%
mutate(usersrated = log(usersrated))
# filter our training set to only games with at least n ratings
games_train = games_full %>%
filter(dataset == 'train') %>%
filter(usersrated >= log(params$min_ratings))
games_validation = games_full %>%
filter(dataset == 'validation')
games_test = games_full %>%
filter(dataset == 'test')
We’re going to use the tidymodels framework for our model training and evaluation, so we’ll create custom splits around these for our workflows.
# make an initial split based on previously defined splits
validation_split = make_splits(list(analysis = seq(nrow(games_train)),
assessment = nrow(games_train) + seq(nrow(games_validation))),
data = bind_rows(games_train,
games_validation))
We can do a bit of exploratory analysis on this to help guide decisions we’ll make in our recipe, which we’ll build on the training set. We’ll look at some of the main numeric features of games, such as player counts, playingtime, and yearpublished, for each of our outcomes.
We can look at the correlation between our outcomes, the main features we have for games (playtime, player count, complexity) and some of the categorical features.
What about mechanics? Let’s look at the correlation between mechanics and these numeric features.
Correlation plot of game info and game mechanics
We can look at how specific mechanics are associated with each outcome, similaer as before, but this time looking at the density of each outcome.
We’ll make a basic recipe, which we’ll then update for each specific outcome.
# creating recipe with no formula or outcome specified yet
base_recipe = recipe(x = games_train) %>%
update_role(all_numeric(),
new_role = "predictor") %>%
step_mutate_at(c("yearpublished",
"averageweight",
"playingtime"),
fn = ~ na_if(., 0)) %>% # these variables come through as 0 if they are missing
update_role(timestamp,
dataset,
game_id,
name,
numcomments,
numweights,
owned,
trading,
wanting,
wishing,
timestamp,
average,
bayesaverage,
averageweight,
usersrated,
stddev,
new_role = "id") %>%
step_filter(!is.na(yearpublished)) %>%
step_filter(!is.na(name)) %>%
step_mutate(missing_minage = case_when(is.na(minage) ~ 1,
TRUE ~ 0)) %>%
step_mutate(missing_playtingtime = case_when(is.na(playingtime) ~ 1,
TRUE ~ 0)) %>%
step_impute_median(playingtime,
maxplayers,
minage) %>% # medianimpute numeric predictors
# step_mutate(published_prior_1950 = case_when(yearpublished<1950 ~ 1,
# TRUE ~ 0)) %>%
step_mutate(minplayers = case_when(minplayers < 1 ~ 1,
minplayers > 10 ~ 10, # truncate
TRUE ~ minplayers),
maxplayers = case_when(maxplayers < 1 ~ minplayers,
maxplayers > 20 ~ 20,
TRUE ~ maxplayers)) %>%
step_rm(minplaytime, maxplaytime) %>%
step_mutate(time_per_player = playingtime/ maxplayers) %>% # make time per player variable
step_mutate_at(starts_with("category_"),
fn = ~ replace_na(., 0)) %>%
step_mutate_at(starts_with("mechanic_"),
fn = ~ replace_na(., 0)) %>%
step_mutate_at(starts_with("artist_"),
fn = ~ replace_na(., 0)) %>%
step_mutate_at(starts_with("designer_"),
fn = ~ replace_na(., 0)) %>%
step_mutate_at(starts_with("publisher_"),
fn = ~ replace_na(., 0)) %>%
step_mutate_at(starts_with("family_"),
fn = ~ replace_na(., 0)) %>%
step_mutate(number_mechanics = rowSums(across(starts_with("mechanic_"))),
# number_artists = rowSums(across(starts_with("art_"))),
number_categories = rowSums(across(starts_with("category_")))) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors(),
-starts_with("publisher_"),
-starts_with("artist_"),
-starts_with("designer_"),
freq_cut = 100/1) %>%
step_corr(all_predictors(),
threshold = 0.9) %>%
step_mutate(published_prior_1950 = case_when(yearpublished < 1950 ~ 1,
TRUE ~ 0)) %>%
step_mutate(trunc_yearpublished = case_when(yearpublished < 1950 ~ 1950,
TRUE ~ yearpublished)) %>% # truncate
# step_mutate(cut_yearpublished= yearpublished) %>%
# step_cut(cut_yearpublished,
# breaks = seq(1970, 2010, 10),
# include_outside_range = T) %>%
step_mutate(cut_playingtime= playingtime) %>%
step_cut(cut_playingtime,
breaks = c(15, 45, 90, 180),
include_outside_range = T) %>%
step_dummy(all_nominal_predictors()) %>%
step_log(playingtime,
time_per_player,
offset = 1) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>% # remove features with no variance
step_nzv(all_predictors(),
-starts_with("publisher_"),
-starts_with("artist_"),
-starts_with("designer_"),
freq_cut = 100/1) %>% # apply near zero variance filter
step_nzv(starts_with("artist_"),
-one_of(c("artist_ian_otoole",
"artist_chris_quilliams")), # allow for some specific artists, well known in recent years
freq_cut = 250/1) %>%
step_corr(all_predictors(),
threshold = 0.9) # remove highly, highly correlated features
For modeling the BGG average, bayesaverage, and usersrated, we’ll include averageweight as a feature and missingness within the recipe with a simple model.
# average
recipe_average = base_recipe %>%
update_role(average,
new_role = "outcome") %>%
update_role(averageweight,
new_role = "predictor") %>%
step_impute_linear(averageweight,
# impute_with = imp_vars(all_predictors())) %>%
impute_with = imp_vars(
playingtime,
time_per_player,
number_mechanics))
# bayesaverage
recipe_bayesaverage = base_recipe %>%
update_role(bayesaverage,
new_role = "outcome") %>%
update_role(averageweight,
new_role = "predictor") %>%
step_impute_linear(averageweight,
# impute_with = imp_vars(all_predictors())) %>%
impute_with = imp_vars(
playingtime,
time_per_player,
number_mechanics))
# usersrated
recipe_usersrated = base_recipe %>%
update_role(usersrated,
new_role = "outcome") %>%
update_role(averageweight,
new_role = "predictor") %>%
step_impute_linear(averageweight,
# impute_with = imp_vars(all_predictors())) %>%
impute_with = imp_vars(
playingtime,
time_per_player,
number_mechanics))
For modeling complexty (averageweight), we’ll trim the dataset further, omitting games for which we haven’t received enough votes on their complexity.
recipe_averageweight = base_recipe %>%
update_role(averageweight,
new_role = "outcome") %>%
step_filter(numweights > 10)
We’ll use a few different methods in modeling our outcome. I’ll mostly rely on penalized regression (glmnet) and xgboost (xgbTree), but I’ll also explore using bayesian linear models with Stan.
# penalized linear regression
glmnet_reg_mod<-
linear_reg(penalty = tune::tune(),
mixture = 0.5) %>%
set_engine("glmnet")
# specify grid for tuning
glmnet_grid <- tibble(penalty = 10^seq(-4, -0.5,
length.out = 30))
# xgbtree for regression
xgbTree_reg_mod <-
parsnip::boost_tree(
mode = "regression",
trees = 250,
sample_size = tune::tune(),
min_n = tune::tune(),
tree_depth = tune::tune()) %>%
set_engine("xgboost",
objective = "reg:squarederror")
# xgbTree grid
xgbTree_grid <-
expand.grid(
sample_size = c(0.5, 0.75, 0.95),
min_n = c(5, 15, 25),
tree_depth = 3
)
# stan linear regression
set.seed(123)
prior_dist <- rstanarm::student_t(df = 1) # student t prior
# prior_dist = brms::horseshoe( # regularized horeshoe
# df = 1,
# scale_global = 1,
# df_global = 1,
# scale_slab = 2,
# df_slab = 4,
# par_ratio = NULL,
# autoscale = TRUE
# )
# lm with stan
stan_reg_mod <-
linear_reg() %>%
set_engine("stan",
prior = prior_dist,
prior_intercept = prior_dist)
We then create workflows for each model, for each outcome.
For the models that rely on tuning parameters (glmnet and xgbTree) we’ll define a cross validation split within the training set and tune across resamples.
We’ll register a parallel back end and then tune models for each outcome.